Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
نویسندگان
چکیده
We present a joint model for Chinese word segmentation and new word detection. We present high dimensional new features, including word-based features and enriched edge (label-transition) features, for the joint modeling. As we know, training a word segmentation system on large-scale datasets is already costly. In our case, adding high dimensional new features will further slow down the training speed. To solve this problem, we propose a new training method, adaptive online gradient descent based on feature frequency information, for very fast online training of the parameters, even given large-scale datasets with high dimensional features. Compared with existing training methods, our training method is an order magnitude faster in terms of training time, and can achieve equal or even higher accuracies. The proposed fast training method is a general purpose optimization method, and it is not limited in the specific task discussed in this paper.
منابع مشابه
A Realistic and Robust Model for Chinese Word Segmentation
A realistic Chinese word segmentation tool must adapt to textual variations with minimal training input and yet robust enough to yield reliable segmentation result for all variants. Various lexicon-driven approaches to Chinese segmentation, e.g. [1,16], achieve high f-scores yet require massive training for any variation. Text-driven approach, e.g. [12], can be easily adapted for domain and gen...
متن کاملAdaptive Chinese Word Segmentation with Online Passive-Aggressive Algorithm
In this paper, we describe our system1 for CIPS-SIGHAN-2010 bake-off task of Chinese word segmentation, which focused on the cross-domain performance of Chinese word segmentation algorithms. We use the online passive-aggressive algorithm with domain invariant information for cross-domain Chinese word segmentation.
متن کاملAdaptive Chinese Word Segmentation
This paper presents a Chinese word segmentation system which can adapt to different domains and standards. We first present a statistical framework where domain-specific words are identified in a unified approach to word segmentation based on linear models. We explore several features and describe how to create training data by sampling. We then describe a transformation-based learning method u...
متن کاملUnknown Word Extraction for Chinese Documents
There is no blank to mark word boundaries in Chinese text. As a result, identifying words is difficult, because of segmentation ambiguities and occurrences of unknown words. Most previous works focus their attention only on the resolution of ambiguous segmentation. The problem of unknown word identification is considered more difficult and needs further investigation. Conventionally unknown wor...
متن کاملChinese Word Segmentation as Character Tagging
In this paper we report results of a supervised machine-learning approach to Chinese word segmentation. A maximum entropy tagger is trained on manually annotated data to automatically assign to Chinese characters, or hanzi, tags that indicate the position of a hanzi within a word. The tagged output is then converted into segmented text for evaluation. Preliminary results show that this approach...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012